Abstract

In this exercise we analyse correlation between HDI and gender (in)equality. Our results indicate strong correlation between main determinants of GII and HDI. We cannot prove causation, but merely suggest that gender equality should be considered when trying to improve HDI.


About the website

This is my Final assignment for the University of Helsinki course: Introduction to Open Data Science (IODS).

Code repository

The R-code used is stored and available for inspection in this GitHub repository

Extra

Instead of following the instructions and making code-panel foldable, I decided to create my own code-wrapper, with some panels foldable and others always visible. If you find my solution too hard to evaluate, I’m happy to deliver the exercise as per your instructions.


Load data

This dataset contains 155 obseravtions and 7 variables exploring dependancy between proxy variables of human developement and gender (in)equality by country. The data originates from http://hdr.undp.org/en/data. Details of data-wrangling can be found here.

HDI attempts to describe quality of life – or how developed a country is, and comprises of life expectancy, education and GNI per capita.

Female labour participation, education, reproductive health and parliamentary representation determine another developemental index, Gender inequality index, (GII)

In this exercise, I explore whether determinats of GII correlate with HDI. My hypothesis is that countries with low GII have better HDI. It’ll be interesting to see which of the determinants of GII correlate most strongly with HDI.

Study variables

Variable Description
HDI Human developement index
GII Gender inequality index
matMort Maternal mortality ratio
adolBirthRate Adolescent birth rate
reprParl Percetange of female representatives in parliament
eduRatio Ratio between edu2F and edu2M
labRatio Ratio between labF and labM

Exploratory Data Analysis

Numercial overview

#Hidden
df %>% select(-country) %>% summaryKable() %>% 
  kable("html", align = "rrr", caption = "Study variable summary") %>%
  kable_styling(bootstrap_options = c("hover", "condensed"))
Study variable summary
HDI GII matMort adolBirthRate reprParl eduRatio labRatio
Min 0.348 0.016 1.000 0.600 0.000 0.172 0.186
1st Q 0.593 0.184 11.500 12.650 12.400 0.726 0.598
Median 0.733 0.385 49.000 33.600 19.300 0.938 0.753
Mean 0.704 0.366 149.084 47.159 20.912 0.853 0.707
3rd Q 0.829 0.524 190.000 71.950 27.950 0.997 0.854
Max 0.944 0.744 1100.000 204.800 57.500 1.497 1.038

These values display no alarming features. Interestingly median of eduRatio is ~1, which I would’ve expected to be much smaller.

Graphical overview

#Hidden
df %>% gather(key="var", value = "value", HDI:labRatio) %>%
  mutate(var=ordered(var, names(df[2:8]))) %>%  #To retain order when facetting
  ggplot() +
  geom_density(aes(value)) + 
  facet_wrap(~var, scales="free") +
  labs(title = "Study variable distribution", 
       y = "Density", x="Value") +
  theme(axis.text.x = element_text(angle = 90))

Most variables are normally enough for my purpose. Maternal mortality si a bit skewed, and has long tail. Logarithmic conversion should make it more normal-like.

#Hidden
# Save a plot of non-normally distributed variables
p1 <- df %>% 
  ggplot() +
  geom_density(aes(matMort)) + 
  labs(title = "Before log-conv", 
       y = "Density", x="Value") 

#Log conversion for non-normal values
df %<>% mutate(matMort = log10(matMort)) %>% 
  rename(matMort_log = matMort)

# Store a plot of new values
p2 <- df %>% 
  ggplot() +
  geom_density(aes(matMort_log)) + 
  labs(title = "After log-conv", 
       y = "Density", x="Value") 

#Show plots
multiplot(p1,p2, cols = 2)

Clearly the distribution is better after the log-conversion.

#Hidden
p <- df %>% select(-country) %>% ggpairs(
  title = "Study variable overview",
  upper = list(continuous = wrap("cor", size = 4, color = "black")),
  lower = list(
    continuous = wrap("points", alpha = .2, size = .4),
    combo = wrap("facethist", bins = 20))) +
  theme(axis.text.x = element_text(
                  angle = 90,
                  color = "black",
                  size = 7,
                  vjust = .5),
      axis.text.y = element_text(color = "black", size = 7),
      strip.text = element_text(size = 8))

p[3,1] <-  p[3,1] + aes(color = "red")
p[4,1] <-  p[4,1] + aes(color = "red")
p[6,1] <-  p[6,1] + aes(color = "red")

p[1,3] <- p[1,3] + aes(fontface = 2)
p[1,4] <- p[1,4] + aes(fontface = 2)
p[1,6] <-  p[1,6] + aes(fontface = 2)

p

This graph contains a wealth of information and plenty of time should be used to familiarize oneself with it. From the scatterplots, we can conclude that many of the variables correlate with each other, and importantly, that no obvious non-linear relationships emerge from the data.

Both, maternal mortality and adolescent birth rate, correlate strongly and negatively with HDI.

EduRatio exhibits moderate positive correlation R=0,68. The relationship is not only linear; for most of HDI-spectrum there is no correlation, but before a certain threshold, a positive slope is visible (i.e. HDI correlates positively with eduRatio).

The actual gender inequality index correlates moderately and negatively with the HDI (and actually, all components of HDI, not shown).


Wrangling eduRatio

Reasoning

Earlier I stated that there seems to be some kind of plateu part in the correlation between HDI and education ratio (ie. after certain threshold, correlation between the variables fades). I guess that in countries with low education, there is also high inequality, and reduction of this inter-gender difference only helps so far.

Binning education ratio

#Hidden
df %<>% mutate(eduRatioT = ntile(df$eduRatio, 5))
p1 <- ggplot(df) + 
  geom_point(aes(eduRatio, HDI, color = as.factor(eduRatioT))) +
  scale_color_discrete(guide = FALSE)

df %<>% mutate(eduIneq = abs(1-eduRatio))
p2 <- ggplot(df) + 
  geom_point(aes(eduIneq, HDI, color = as.factor(eduRatioT))) +
  scale_color_discrete(name = "EduRatio \n quintile") +
  theme(legend.position = c(.9, .85))

multiplot(p1, p2, cols = 2)

Based on this plot, my initial assumption wasn’t entirely correct. Yes, HDI increases with eduRatio, but seems to peak when there is no difference between genders, and decreases after that.


Linear regression

We start by analyzing the proposed correlations between HDI and maternal mortality, adolscent birth rate, and education ratio.

First model

#Hidden

# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio, 
             data = df)

## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0), 
    mar = c(2.5,3,2,0.5), 
    mgp = c(1.5,.5,0))


plot(model, which = c(1,2), add.smooth = T)

norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))

aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)

plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")

Based on the Cook’s distance, our model is bogged with a influential outlier. The observation in question is Myanmar, which is excluded from further analysis.

Second model

#Hidden

# Remove the outlier
df <- df[-tail(order(cooks.distance(model)),1),]

# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio, 
             data = df)

## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0), 
    mar = c(2.5,3,2,0.5), 
    mgp = c(1.5,.5,0))


plot(model, which = c(1,2), add.smooth = T)

norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))

aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)

plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")

There appears to be one more outlier, with high influence. This time it’s Belarus, which is excluded from further analysis.

Final model

#Hidden

# Remove the outlier
df <- df[-tail(order(cooks.distance(model)),1),]

# New model
model <- lm(HDI ~ matMort_log + adolBirthRate + eduRatio, 
             data = df)

## Regression plots
par(mfrow = c(2,2), oma = c(0, 0, 2, 0), 
    mar = c(2.5,3,2,0.5), 
    mgp = c(1.5,.5,0))


plot(model, which = c(1,2), add.smooth = T)

norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))

aa <- df$HDI
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)

plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")
plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")

According to diagnostic plots, this model has no critical errors:
1. Residuals are the difference between fitted and the actual value. In this plot we see no clustering, or any other patterns, that could indicate problems in the model. Variance of errors is constant
2. The Q-Q plot demonstrates that model performs well with median values, but at low values, it starts to accumulate error.
3. In this model, no observations have high leverage.
4. No single observation affects model too much.

Summary of the final model

summary(model)
## 
## Call:
## lm(formula = HDI ~ matMort_log + adolBirthRate + eduRatio, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.161452 -0.033929  0.002323  0.035044  0.138537 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.8222244  0.0273297  30.085  < 2e-16 ***
## matMort_log   -0.1490740  0.0103201 -14.445  < 2e-16 ***
## adolBirthRate -0.0003674  0.0001681  -2.185   0.0304 *  
## eduRatio       0.1807900  0.0212869   8.493  1.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05018 on 149 degrees of freedom
## Multiple R-squared:  0.8974, Adjusted R-squared:  0.8953 
## F-statistic: 434.5 on 3 and 149 DF,  p-value: < 2.2e-16

The model explains 89% of variance of HDI. All our explanatory variables correlate with significantly with HDI, although adolescent birth rate just barely. This low value is probably due to collinearity between maternal mortality and adolBirthRate.

It’d most interesting to explore why Belarus and Myanmar were such a strong outliers in this model. My gut-feeling would be that in case of Myanmar has fairly low rates of maternal mortality and adolescent birth rate, and that HDI is lagging behind rapid changes in these values. For Belarus, I’d suppose that it has relatively high HDI compared to differences in education. These are merely my personal ponderings and I will not explore these outliers more deeply.


Linear discriminant analysis

Next I wanted to explore LDA, a form of dimensionality reduction.

LDA

First, the dependent variable, HDI, is divided to 4 equally sized categories. All other variables are scaled to have

#Hidden
# Explanatory variables are scaled
df %<>% mutate_all(funs(scale))
# sapply(df, sd)

# HDI is divided into quartiles
df$HDI <- ntile(df$HDI, 4) %>% 
  factor(labels = c("low", "med-lo", "med-hi", "high"))

# The data is sampled into training and testing sets
sample_ind <- sample(nrow(df),  size = nrow(df) * 0.8)
train <- df[sample_ind,]
test <-  df[-sample_ind,]
#data.frame(dim(train),dim(test))

#LDA
lda.fit <- lda(HDI ~ ., data = train)
lda.fit
## Call:
## lda(HDI ~ ., data = train)
## 
## Prior probabilities of groups:
##       low    med-lo    med-hi      high 
## 0.2786885 0.2213115 0.2704918 0.2295082 
## 
## Group means:
##        matMort_log adolBirthRate    reprParl   labRatio    eduIneq
## low      1.2678408    1.15206192 -0.14125200  0.3672907  1.3033668
## med-lo   0.2649995   -0.06024558 -0.08287924 -0.2742146 -0.2148443
## med-hi  -0.4002100   -0.39092174 -0.21437236 -0.2049382 -0.5017317
## high    -1.2842240   -0.88781541  0.44235476  0.3207093 -0.6437285
## 
## Coefficients of linear discriminants:
##                       LD1        LD2           LD3
## matMort_log    2.08351901 -1.2848210  7.566608e-01
## adolBirthRate -0.16044762  0.3834058 -7.377454e-01
## reprParl      -0.07514800  0.1977839  8.649420e-01
## labRatio       0.06350302  0.6644102 -3.612129e-05
## eduIneq        0.67966255  1.1817252 -1.030491e-02
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9235 0.0734 0.0030

Almost all variation in the data is explained just by maternal mortality and education inequality.

Visualising the mode

#Hidden
library(plotly, warn.conflicts = F)

# Plot the lda results
points <- data.frame(HDI = train$HDI,
                     lda = predict(lda.fit)$x)
levels(points$HDI) %<>%  str_to_title()

arrows <- coef(lda.fit) %>% 
  data.frame(., label = rownames(.)) %>% arrange(desc(abs(LD1))) %>% 
  mutate(LD1 = LD1*2.5, LD2 = LD2*2.5, LD3 = LD3*2.5, pos = 1) %>% 
  rbind(., mutate(., LD1=0, LD2=0, LD3=0, pos =0)) 


p1 <- plot_ly(arrows, x = ~LD1, y = ~LD2, z = ~LD3, 
  type = "scatter3d" , color = ~label, colors = rep(rgb(0, 0, 0), 13),
  opacity = .5, mode = "lines", hoverinfo = "name", showlegend = FALSE, 
  line = list(width = 5))

p2 <- plot_ly(points, x = ~lda.LD1, y = ~lda.LD2, z = ~lda.LD3, 
    type = "scatter3d" , color = ~HDI, opacity = .75, hoverinfo = "none",
    mode = "markers", marker = list(size = 3, width = 2)) %>% 
  layout(title = "PCA",
       scene = list(xaxis = list(title = "LDA1"),
                    yaxis = list(title = "LDA2"),
                    zaxis = list(title = "LDA3")))

subplot(p1, p2)

Testing the model

table("HDI" = test$HDI, 
      "Prediction" = predict(lda.fit, newdata = test)$class)
##         Prediction
## HDI      low med-lo med-hi high
##   low      5      0      0    0
##   med-lo   0     10      1    0
##   med-hi   0      2      3    0
##   high     0      1      2    7

Model works very well. It correctly categorizes 25/31 observations of the test set.


Discussion

In this study, we explored wheter determinants of GII correlate with HDI. Our analysis concludes that education inequality, and more importantly, maternal mortality correlate strongly with HDI. On our linear regression model, we could explain nearly 90% of variance in HDI with just maternal mortality, adolescent birth rate and education inequality. Same trend was clearly visible in LDA.

Interestingly, we identified couple outliers (Belarus and Myanmar), both of which merit further studies.

Based on these findings, which are in line with our current understanding of developemental drivers, improving gender euqlity should help raise HDI. These results should be considered when deciding on interventions for raising HDI.